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Abstract 

Markov Chain Monte Carlo (MCMC) methods are employed to sample from 
a given distribution of interest, w, whenever either ir does not exist in closed 
form, or, if it does, no efficient method to simulate an independent sample from 
it is available. Although a wealth of diagnostic tools for convergence assessment 



of MCMC methods have been proposed in the last two decades, the search for a 

o; 

dependable and easy to implement tool is ongoing. We present in this article a 



criterion based on the principle of detailed balance which provides a qualitative 



^ assessment of the convergence of a given chain. The criterion is based on the 
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behaviour of a one-dimensional statistic, whose asymptotic distribution under 
the assumption of stationarity is derived; our results apply under weak con- 
ditions and have the advantage of being completely intuitive. We implement 
this criterion as a stopping rule for simulated annealing in the problem of find- 
ing maximum likelihood estimators for parameters of a 20-component mixture 
model. We also apply it to the problem of sampling from a 10-dimensional 
funnel distribution via slice sampling and the Metropolis-Hastings algorithm. 
Furthermore, based on this convergence criterion we define a measure of effi- 
ciency of one algorithm versus another. 

KEY WORDS: Metropolis-Hastings; slice sampling; Markov chain Central 
Limit Theorem; detailed balance; ergodic Markov chain; equilibrium; station- 
ary distribution. 

1. INTRODUCTION 

Let ir be a given distribution such that either n does not exist in closed form or 
no efficient method to simulate an independent sample from it is available. Sup- 
pose that interest lies in the expected value of a random variable h(X), denoted by 
E„.(/i(X)), where X has distribution n. Monte Carlo sampling methods (Hammersley 
and Handscomb 1964) such as rejection sampling, importance sampling or sampling- 
importance resampling (SIR) approximate the value of K n (h(X)} by sampling from 
a distribution g that closely resembles 7r (Smith and Gelfand 1992). Although for low 
dimensional distributions ix it is oftentimes possible to find sampling distributions g 
that provide estimates to within given accuracy with low computational cost, these 
sampling methods suffer greatly from the curse of dimensionality. 

The need to approximate the value of high dimensional integrals arising in statisti- 
cal mechanics led to the development of MCMC sampling methods. The first MCMC 
method, known today as the Metropolis Monte Carlo algorithm, was proposed by 



Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953) as a general method 
for studying the equilibrium properties of systems consisting of many interacting par- 
ticles. The algorithm simulates the behaviour of the system under equilibrium, and 
the expected value of a given property is approximated by ergodic averages based on 
these simulations. In statistical terms, the Metropolis Monte Carlo algorithm con- 
structs an ergodic Markov chain {X t , t — 1, . . . , n} with stationary distribution it, i.e. 
as the number of iterations n tends to oo, the conditional distribution of X n given 
the value of Xi converges to it regardless of the starting distribution g, where Xi has 
distirubtion g (in notation: X\ ~ g). 

Hastings (1970) generalized the procedure of proposing the next move X t given 
X t _i = Xt-i- His algorithm, known as the Metropolis-Hastings algorithm, trans- 
forms an arbitrary stochastic matrix into a 7r-reversible one, and only requires that 
it be known up to a normalizing constant. An equally popular MCMC algorithm 
is the Gibbs sampler, introduced by Geman and Geman (1984) with an application 
to image restoration. This algorithm proposes the next move by sampling from the 
full conditional distributions and, unlike the Metropolis-Hastings algorithm, accepts 
each proposal with probability 1. Two well-known variants on Gibbs sampling are 
the data- augment at ion algorithm of Tanner and Wong (1987) and the substitution 
sampling algorithm of Gelfand and Smith (1990). 

The goal of MCMC methods is to produce an approximate i.i.d. sample 
{Xk+i,Xk+2, ■ ■ ■ ,Xfc+ n } from it, where K, n > 1, and K is known as the number 
of 'burn-in' iterations to be removed from the beginning of the chain. Analysing 
the output of an MCMC method consists of assessing convergence to sampling from 
it, convergence to i.i.d. sampling, and convergence of empirical averages of the form 
- XlILi h{Xx +i ) to E 7r (/i(X)) = f h{x)-n{x)dx as n — *■ oo. Robert and Casella (2004) 
argue that while convergence to tt is not of major concern since it can only be achieved 



asymptotically, the issues of convergence to i.i.d. sampling and of convergence of em- 
pirical averages are strongly interrelated and depend on the mixing speed of the chain. 
By definition, a chain whose elements converge rapidly to weakly correlated draws 
from the stationary distribution is said to possess good mixing speed. Therefore, the 
mixing speed of a chain is determined by the degree to which the chain escapes the 
influence of the starting distribution and by the extent to which it explores the high 
density regions of the support of 7r. 

Recent research in MCMC methodology has focused on developing, on one hand, 
samplers that escape quickly the attraction of the starting distribution as well as 
that of local modes, and, on the other hand, convergence assessment criteria for 
analysing the mixing speed of a given chain. A recent sampling algorithm which 
exploits the idea of jumping between states of similar energy to facilitate efficient 
sampling is the equi-energy sampler of Kou et a/. (2006). Robert (1995,1998), Cowles 
and Carlin (1996), and Brooks and Roberts (1998) present a comprehensive review of 
the practical implementation of convergence criteria and the mathematics underlying 
them. Liu (2001), Neal (1993), Brooks (1998), and Kass, Carlin, Gelman, and Neal 
(1998) offer an in-depth introduction to MCMC methodology and its applications, as 
well as discussions on the issues surrounding it. 

The common view among researchers and practitioners is that developing a good 
sampler or a reliable convergence criterion is problem-specific. A sampler with good 
mixing speed when sampling from a relatively smooth, low-dimensional distribution 
might become trapped in a well of low probability when sampling from a distribution 
having many local modes. Similarly, a convergence criterion which proves reliable 
for analysing a given MCMC output might incorrectly assess the convergence of a 
chain that has only explored a subset of the entire support space. Our interest lies in 
convergence assessment, in particular, in identifying lack of convergence. We define 



a one-dimensional statistic and derive an intuitive criterion based on the principle of 
detailed balance that provides a qualitative assessment on the convergence of a given 
MCMC chain. 

In Section 2 we recall basic notions and results from the theory of Markov chains, 
which we subsequently use in Section 3 to derive the asymptotic distribution of our 
proposed statistic under the assumption of stationarity. In the same section, we dis- 
cuss two possible implementations of our criterion, one using the asymptotic distribu- 
tion, the other experimental as a qualitative tool. Section 4 discusses two applications: 
one as a stopping rule for simulated annealing, an algorithm for function maximization 
applied to the problem of finding maximum likelihood estimators (Azencott 1992), 
the second as a graphical tool for comparing the performances of Metropolis-Hastings 
versus slice sampling for the problem of sampling from a 10-dimensional funnel distri- 
bution. All computations were performed using code written in C++. We conclude 
in Section 5 with general remarks, comparisons, and criticisms. 

2. PRELIMINARIES 

Let X = {X t , t — 1, 2, . . .} be a Markov chain with state space S and transition 
probability matrix P = (pij). We refer the reader to Medhi (1994), Norris (1997), 
and Jones (2004) for details and proofs. For the purpose of the convergence criterion 
we present in this article, we restrict our attention to finite Markov chains. 

Let p\j be the transition probability from state i to state j in n steps. The 
Ergodic Theorem states that if X is irreducible and aperiodic, then the limits TTj : = 
liuin^ooplj exist and are independent of the initial state i for all i,j G S and (ttj, j G 
S) is the stationary distribution of X. The chain X is called ergodic. 

Definition 1 (Principle of detailed balance) Transition probability matrix P and 
probability distribution n are said to be in detailed balance, or, equivalently, the prin- 
ciple of detailed balance is said to hold, if ' itiPij = ^jPji V«, j G S. 
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Definition 2 A Markov chain X with irreducible transition probability matrix P 
and initial distribution g, i.e. X\ ~ g, is reversible if, for all N > 2, the chain 
{Xn,Xn-i, ■ ■ ■ ,X2,Xi} is a Markov chain with transition probability matrix P and 
initial distribution g. 

Norris (1997) proves that if X is irreducible, then it is reversible if and only if P 
and g are in detailed balance, where g is the initial distribution of X. The following 
definitions are needed to introduce the Markov chain Central Limit Theorem (Jones 
2004). 

Definition 3 Let M(i) be a nonnegative function and ^{n) a nonnegative decreasing 
function on the positive integers such that 

||P n (v)-7r(-)||<M(z)7(n). (1) 

Let X be a Markov chain on state space S with transition probability P and stationary 
distribution n. If (QP holds for all i G S with ^(n) = t n for some t < 1, then X is 
geometrically ergodic. If, moreover, M is bounded, then X is uniformly ergodic. If 
(Q]j holds for all i G S with 7(n) = n~ m for some m > 0, then X is polynomially 
ergodic of order m . 

Theorem 1 The Central Limit Theorem (finite state space) Let X be an ergodic 
Markov chain on state space S with stationary distribution n. Let h : S — » R be a 
Borel function. Assume that one of the following conditions holds: 

1. X is polynomially ergodic of order m > 1, E n M < 00 and there exists B < 00 
such that \h(X)\ < B almost surely; 

2. X is polynomially ergodic of order m, E n M < 00 and J5 7r (|/i,(.X')| 2+<5 ) < 00 
where m5 > 2 + 5; 
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3. X is geometrically ergodic and E ir (\h(X)\ 2+s ^j < oo for some 5 > 0; 
4- X is geometrically ergodic and E n (h 2 (X)\log + |/i(X)|]) < oo; 

5. X is geometrically ergodic, satisfies detailed balance and E n h 2 (X) < oo; 

6. X is uniformly ergodic and E n (h 2 (X)) < oo. 
Then for any initial distribution, 

\/n[ h n — E^{h{X)) j — ► Normal (0, a%) as n — ► oo ; 

where h n = \ YTi=\ K X i) an d a l = Y&1 ^ {K x i)) + 2 Ya=2 cov ^ (M^i)> H x i)) < °°- 
3. DETAILED BALANCE AND CONVERGENCE DIAGNOSTICS 

Let 7r = (7Tj, i G S) be a discrete distribution with finite state space 5, m = \S\. 
Let {X t , t = l,...,n} be an irreducible, aperiodic Markov chain with transition 
probability matrix P = (pij) and stationary distribution it. We say that a chain 
has reached equilibrium by step t if P f (i,j) = ttj, Vi,j G S and 3i,j G 5 such that 
P tl {hj) 7^ 7Tj. Our convergence assessment criterion is based on the principle of 
detailed balance from statistical mechanics (Chandler 1987). Statistical mechanics is 
concerned with the study of physical properties of systems consisting of very large 
number of particles, for example liquids or gases, as these systems approach the 
equilibrium state, i.e. a uniform, time-independent state. In these terms, the principle 
of detailed balance states that a physical system in equilibrium satisfies 

TTi _ Pji _ ( E t - Ej 



exp -^— ^ , W,jeS, 
7Tj p^ V kl / 

where E^ is the energy of the system in state i, k is Boltzmann's constant, T is the 
temperature, and 7Tj and p^ have the usual interpretation. 



We assume that the Markov chain {X t , t = 1, . . . ,n} is constructed to satisfy 
detailed balance. This is oftentimes the case since the principle of detailed balance 
implies that it is the stationary distribution of the chain, and it is easier to check the 
former than the latter, see for example the discussions on the Metropolis-Hastings 
(Hastings 1970) and slice sampling algorithms (Neal 2003). We introduce the no- 
tion of an energy function E{ oc — log(7Tj), Vi G S. When implementing simulated 
annealing, the stationary distribution at temperature T^ is 7r 1//Tfc , so the energy func- 
tion becomes Ei = — logln^/Tj,, where {Tj,, k — 1, 2, . . .} is a sequence of decreas- 
ing temperatures. Therefore, the equilibrium probability of being in state i equals 
7Tj = -|exp(— Ei), where the normalizing constant is defined as Z := X]ie,s ex P( — ^i)- 
Define the following approximation to 7Tj based on a Markov chain of n iterations 



1 n 

-^2l(Xj = i), VieS. 



Ki = 

n . 

The idea of working with indicator functions is similar to that of Raftery and 
Lewis (1992) who develop a convergence assessment method based on the sequence 
{I(X t < i), t — 1, . . .}, for fixed i G S. We point out that, for fixed i G S, the sequence 
{I(X t = i), t = 1, . . .} forms a Markov chain, whereas the sequence defined by 
Raftery and Lewis does not. Brooks et al. (2003) use a similar approach of estimating 
the stationary distribution by the empirical distribution function obtained from the 
MCMC output; they derive nonparametric convergence assessment criteria for MCMC 
model selection by monitoring the distance, as the number of simulations increases, 
between the empirical mass functions obtained from multiple independent chains. 

Our criterion assesses the convergence of the chain by comparing the behaviour 
of the functions fi = 7Tj/exp(— Ei), i G S, to their average /= m^jesfji v * a ^ e 
statistic V n := ^ E ie <? {fi~P) 2 - 

3.1 Theoretical approach 



We proceed to derive the distribution of the statistic V n under the hypothesis that 
the chain has reached stationarity, i.e. that Xi ~ ir, Vi = 1, . . . , n. 



Vn 



m *— ' m *-^ I m *— ' m m ^— ' I m *— ' 

ies k jes ^ ies k jes ^ ies 






where f = (f;,2 G ST and a; = ( — —,..., — — , 1 — -, — — , . . . , — — ) is an m- 
dimensional column vector with ith entry equal to 1 — — and the remaining entries 
equal to — — . Define the following (m x m) dimensional matrix 

/ &1 >\ I 1-i -A. -J, -i \ 

a 2 ' 



V a m ' / 

BoK = ^{Af}'{Af}. 

First, we observe that Vi G S 1 , 
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(£ - E*/;, j G S^a/ = (l - 1) [/< - E^/i] - ^ E (/; - E ^) = /* - /' ( 2 ) 
since E^/j = 4, Wj E S. Second, we notice that 
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Vi G 5. 



(3) 



Define Wi jn := \/n(jti — 7Tj), Vi G S 1 , and the m-dimensional column vector W n 
(Wi, n ,i G S)'. From $3) and (ED, we obtain that V n = {CW n }' {CW n } , where 
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The following result presents the asymptotic distribution of the statistic V n under the 
assumption of stationarity. 
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Theorem 2 Under the conditions of Theorem d CW n — ► Normal (0, CSC) and 
V n — > X^t=i ^«^ 2 as n ~~ * °°> where Ai, . . . , A& are rjne characteristic roots of CSC 
and Zi, . . . , Zfc are i.i.d. Normal(0, 1) random variables. 

proof: We begin by pointing out that irreducible and aperiodic Markov chains on 
finite state spaces are uniformly ergodic (Roberts and Rosenthal 2004), so condition 
(6) of Theorem [T] is satistifed. It follows that for every i G S, 

W i>n = v^ - n t ) = v^{ - J2 f ( X 3 = - E. {I(X 1 =ij)\Z Normal (0, a}) 

as n — > oo, where 

oo 

a] = 7Ti(l - 7i t ) + 2 Y, [P{HXj =i) = l|I(^i = «) = IK - ^] < °°- 

i=2 

By the Cramer- Wold Device (Billingsley 1968, Varadarajan 1958), it follows that 
W n — > Normal (0, S) as n — > oo, where is an m- dimensional column vector of zeros 
and S is an (m x m) variance-covariance matrix whose entries are given 



S(i,i) = cr? 



2 



f 1 n n 1 

E(i, j) = lim cov w (W ijn , W^) = lim <^ - V V cov^ (l(X fc = z), I(*j = j)H 

*> fc=l J=l ' 

r i n 1 " 

= ]l™ - E H Xfc = *> x * = rf - H + ^ E H x * = « = # 

^ fc=l fc,Z=l 

+ ^E[ p {^ = ^^=^}-H} 



-TTiTTj 
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So, for all i,jES,iy£ j 






( n 
T,(i,j) = -TiiTTj + lim — <^ V^ \P{X l = j\X k = i\ - iij 

^ k,l=l 

k<l 


71 

+ Y,[P{Xi=j\X k = i}-Ti 3 


} 


fej=i 

Kk 

oo 

= -Ti.Tij + 2tt, J2 [ P { X k = j\Xi 


= i}~ Tlj 


< oc 



fc=2 

The last equality follows from the fact that if a Markov chain satisfies detailed balance, 
then it is reversible, i.e. for k > 1, P{Xi s = j\X\ = ?} = P{X\ = j\X k = «}. Finally, 
the conditions of the Markov chain Central Limit Theorem guarantee that the infinite 
summation in the last line is finite. 

It then follows that CW n — ► Normal (0, CSC) as n — ► oo. Lastly, since V n = 
{Clf„} {CW 7 ^}, it follows from Lemma 1 in Chernoff and Lehmann (1953) that 
V n — ► Xli=i -^i-^j 2 as n -^ oo, where Ai, . . . , A*, are the characteristic roots of CSC 
and Zi, . . . , Z k are i.i.d. Normal(0, 1) random variables. 

Q.E.D. 

Example 1 Let the Markov chain be generated by the Metropolis-Hastings algo- 
rithm with symmetric proposal probability matrix P = (pij). The expressions for 
E(i, i) and S(i,j) can be simplified as follows. Consider the Markov-Bernoulli chain 
|l(Xj = i), j = l,...,n} for fixed i G S with transition probability matrix Pi = 

1 — a a \ 

. It is shown in Medhi (199 % pp. 101-102) that 
b 1-6 J 

Pi-* = -L-( b *V (i-*-»)W ° - hVj>2 . 

a + b\ b a a + b 1 _ b b 
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Now, 

% P ^ Xl = h X2 = l l P{X 2 = i}- PjX, = i,X 2 = ^} 

( ' 1 - P{X 1 = i} 1 - P{X l = i} l-rci 

b = l-P{X 2 = i\X 1 = i} = l- Pii . 



TV.: 



(l-Pit)' 



Then, provided that max{0, 2-Ki — 1} < pa < I, \/i G S, 
£(z,i) = 7r i (l-7r i ) + 2 2^7ri(l-7Ti)(^^ ) = 7— , 

J— ^ 

oo 

E(i,j) = -7r i 7r J H-27r i ^(p fe - 1 (i,j)-^), /ori^j. 

fc=2 

3.2 Implementation 

Let {Xk+i, Xk+2, ■ ■ ■ , Xx+n} be an irreducible and aperiodic Markov chain with 
finite state space S and stationary distribution n that satisfies detailed balance. A 
burn-in of K draws are discarded, where K depends on the rate of convergence of the 
sampling algorithm on 7r (Brooks 1998). We implement our convergence assessment 
criterion as a test of hypothesis under the null hypothesis that the chain has reached 
stationarity by iteration K + 1. 

For n large enough, V n = ^2 i=1 \Zf, and we estimate its distribution using Lya- 
punov's Central Limit Theorem (Loeve 1963). Since Zi is Normal(0, 1), Zf is ^L, so 
E(XiZf) = \i and var {\Zf) = 2A 2 , for i = 1, . . . , k. Define Y t = X^Zf-Xf, E(Y<) = 0, 
and var(Fi) = E(F. 2 ) = 2A 2 < oo for i = 1, . . . ,n. Moreover, E(Y. 3 ) = -4A 3 < oo, 
so E|Fj 3 | < oo, for i — 1, . . . , k. Define s 2 = Y^i=i var (Y) — 2 X^ =1 ^f ■ ^ remains 

i l l Q 

to show that the following condition holds: lim^oo ^ i=1 E7j / s\ = 0, which is 
equivalent to showing that 

1 k 

lim \^|AJ 3 = 0, (4) 



A— x ('•) V A ' \2^ 3 / 2 



^EtxAf) 
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since E|F;| 3 = |A 4 | 3 E|Z.f - l| 3 w 8.6916|Ai| 3 , fori = l,...,k. So, provided that 
condition (jl]) is satisfied, Lyapunov's Central Limit Theorem gives the following result 
for k and n large enough: 

k / k k 



V n = y, ^i%i ~ Normal I 2_, ^U 2 /, tf ) approximately. 

i=l ^ i=l i=l ' 



(5) 



For the computation of the mean and variance in (jSJ), we resort to the following 
simplifications 

k m 

J2^ = trace(CSC') = ^ [C(M)] 2 £(M), (6) 

4=1 i=l 

£\ 2 = (E A 2 " 2 S> A - (7) 

i=l i=l »,j=l 

where the first summation in equation ((7|) is given in ([6]), and the second is the sum 
of all the 2-square principal subdeterminants of CSC (Marcus and Ming 1964, p. 
22). 

We propose a quantitative assessment of convergence via a test of hypothesis at 
confidence level (1 — a) using the approximate distribution of V n given in (jSJ) as 
follows. 

1. Obtain an aperiodic, irreducible Markov chain which satisfies the principle of 
detailed balance: {Xi, X 2 , . . . , Xk, • • • , Xx+n}', discard the first K draws. 

2. Compute the statistic V n — — ^2 ieS (fi — f) from the remaining n draws and 



the (1 - a/2) quantile v a/2 = Y!l=i \ + z <*/2 \J 2 YX=i \ 2 - 

3. If V n < v a /2, conclude that the chain has reached stationarity at level (1 — a) 
and stop; else, continue for an additional n iterations and return to step 2, 
replacing n by In. 
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In this article we implement the criterion in the form of a qualitative tool for 
convergence assessment. We iterate the chain and plot the absolute value of the 
relative difference, | (V^-i) n — Vkn) /V(k-i)n \ , against the number of iterations kn, 
every n iterations, k = 1,2,.... We claim that the chain has reached equilibrium 
if the relative difference drops below some problem-specific, pre-specified constant 
e > 0. The value of the constant e is problem-specific because it depends on the 
distribution of interest n. For a high-dimensional, multi-modal distribution, the value 
of e might need to be very small in order for this analysis to correctly detect lack 
of convergence to n, whereas the same value might be too conservative for a one- 
dimensional, unimodal distribution. 

Based on this implementation of the criterion as a qualitative tool, we can define 
a measure of efficiency of one algorithm against another. Let e > be given. Let 
Vn be the value of the statistic after n iterations of algorithm i, i — 1, 2. Let rii 
represent the interval, in iterations, at which the statistic is computed for algorithm 
i. The measure of efficiency is defined as 

v(e) = ^{kn.:\(V^ 1)ni -V^)/V^ 1)ni \<e} 

^ min {^: |(^! 1)n2 -n ( 3/<-i)nJ <e }' 
If V12 < 1, we conclude that algorithm 1 is more efficient than algorithm 2 at level 
e; if V12 > 1) algorithm 2 is more efficient than algorithm 1. 

4. APPLICATIONS 

4.1 Application 1: multipath changepoint problem 

The following application is taken from Asgharian and Wolfson (2001). Let Yy de- 
note the jth measurement on patient i, where 1 < i < 100, 1 < j < 20. To 
each patient there is associated a possibly distinct changepoint r, such that mea- 
surements Yn, Y i2 , . . . , Y ir . are i.i.d. Normal(0, 1) random variables and measurements 
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Y iT . + i, . . . , Y i2 o are i.i.d. Normal(4, 1). Let Zj = (1, Zn)' and 9 = (6q, 9i)' denote the 
covariate vector and the regression coefficient vector, respectively, for patient i, i.e. 
Yij =9o + OiZn, Vj. Define parameters a = 8q + 0\ and j3 = 9q — Q\. The goal is 
to find the maximum likelihood estimators (MLE's) of a and /3, denoted by a and (3, 
respectively. We simulate the data with 6q = and Q\ = 1; the joint log likelihood 
is bimodal. We let the parameter space be (—10, 10) 2 , assuming zero mass is placed 
outside this region, and we discretize the space over a grid of width 0.01. 

We apply the algorithm of simulated annealing, introduced by Kirkpatrick, Gelatt, 
and Vecchi (1983), which performs function optimization through an iterative im- 
provement approach. The algorithm was developed via an analogy with thermody- 
namics where a substance is melted by a slow annealing process and equilibrium is 
attained at each temperature until eventually the substance stabilizes at its lowest- 
energy state. Similarly, in simulated annealing, a global temperature parameter con- 
trols the effects of high probability regions under the distribution of interest it. For 
each Tk in a sequence such that T& — » as k — » oo, an MCMC chain with station- 
ary distribution 7r 1//Tfe is generated until equilibrium. As the temperature is lowered 
following a pre-specified schedule, known as the cooling schedule, the effects become 
more pronounced and the chain stabilizes at its global maximum value or equivalently, 
lowest energy state (Neal 1993, Brooks and Morgan 1995). Geman and Geman (1984) 
show that this convergence is guaranteed under a logarithmic cooling schedule, which 
unfortunately is too slow to be followed in practice. 

We implement the algorithm with a geometric cooling schedule Tk+\ = T^/2, k = 
0, . . . , 5, and To = 50 and zero burn-in. Simulated annealing with a very fast cooling 
schedule is known as simulated quenching; refer to Catoni (1992) for a discussion 
on the design of cooling schedules. For (a, (3) E (— 10, 10) 2 , the function f^l at 
temperature T k is given by /^ = 7r a)/? /exp(-£ , ( a)/9) ). 
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The aim is to compare the performance of the Metropolis-Hastings sampler in de- 
termining the MLE's via simulated annealing with two different methods for proposing 
the next move. In the first method, we draw uniformly from a cube of length w cen- 
tered at the current position, where w has the values: {12,7,4,2.5,1.7,1.2,0.9,0.6} 
for k = 1, . . . , 8. These values are set retrospectively to obtain an acceptance rate of 
approximately 0.4. In the second method, we propose the next move via univariate 
slice sampling applied to each variable in turn; this algorithm is described briefly in 
Subsection 4.2. We use the "stepping-out" procedure with an initial interval size of 
0.1 at each temperature. 

At each temperature, we perform 1000 iterations of the Metropolis-Hastings algo- 
rithm, computing the value of V n every 25 iterations. We obtain the following results: 
(a«/3«) = (1.18,-1.17), E (&(1)>$m) = 247.645, and (a® J^) = (1.19,-1.15), 
E, & (2) 3(2)) = 247.645 for the first and second methods, respectively, which equal the 
lowest energy value obtained by a systematic grid search. We conclude that both 
methods correctly identified the MLE's. Figures Q] and [2] display the relative differ- 
ence in variance; sharp drops indicate that the sampler has jumped to previously 
unexplored regions of the parameter space, i.e. to points (a, (3) for which % a ^ is 
significantly different from 7T aj/ g, thus increasing the value of the variance. 

We proceed to simulate 50 datasets; for each, we initialize the two chains from 
the same randomly chosen point. At each temperature level, we compute the value 
of V n every 25 iterations until | (V(k-i)n — ^fen)/^(fe-i)n < e ) with e = 0.05. We 
remark that this value of e is very conservative; ideally, a different value would be 
employed at each temperature level. We make the following two observations: first, 
for any given dataset, the lowest energy values reported by the two algorithms dif- 
fer by at most 0.011 units in magnitude, and, second, the difference between the 
lowest energy values found by a systematic search and by simulated annealing is at 
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Figure 1: Relative difference in V n versus n using uniform proposal distributions for 
application 1. The plots show the decreasing trend of the relative difference in V n as 
the number of iterations increases, interrupted by sharp increases in V n . 

most 0.614909. Moreover, we note that the methods required on average 5605 it- 
erations, and 3162 iterations, respectively. Averaged over 50 tests, the measure of 
efficiency of simulated annealing using Metropolis-Hastings with uniform proposals 
versus Metropolis-Hastings with slice sampling is approximately 1.77, i.e. MCMC 
with slice sampling is almost twice as efficient as MCMC with uniform proposals. 

4.2 Application 2: 10-dimensional funnel 

Neal (2003) illustrates the advantage of slice sampling over Metropolis-Hastings in 
sampling from a 10-dimensional funnel distribution. Slice sampling is an adaptive 
MCMC method which proceeds in two alternating steps. Given the current position 
X t = x t , it samples a value y uniformly from the interval (0, n(xt))- Given y, the next 
position X t+ i is sampled from an appropriately chosen subset of the horizontal "slice" 
{x;n(x) > y}. Neal (2003) shows that the algorithm produces an ergodic Markov 
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Figure 2: Relative difference in V n versus n using slice sampling for application 1. 
The plots show the decreasing trend of the relative difference in V n as the number of 
iterations increases; the increases in V n are more frequent than in Figure [TJ 

chain with stationary distribution tt, and that, moreover, due to its adaptive nature, 
the algorithm sometimes outperforms Metropolis-Hastings and the Gibbs sampler. 

Let X be a Normal(0,9) random variable, and let Yi, . . . ,Y 9 be independent 
Normal random variables, which, conditional on X = x, have mean and variance 
exp(x). The goal is to obtain an approximate independent sample from the joint 
distribution of (X, Yx, . . . , Yg). We initialize the chain as follows: X = and Y± = 1, 



for i 



9. For each variable, the parameter space is taken to be (—30.0, 30.0) 



and it is discretized over a grid of width 0.01. 

First, we implement the Metropolis-Hastings algorithm with single- variable up- 
dates applied to each variable in sequence; one iteration of the chain consists of 1300 
updates. For each variable, the proposal distribution is Normal, centered at the cur- 
rent value, with standard deviation of 1.0, truncated on the interval (—30.0,30.0). 



Metropolis-Hastings 



Metropolis-Hastings 



1000 2000 3000 4000 




Figure 3: Sampled values and relative difference in V n in application 2. The left col- 
umn displays histograms of the sampled values of X superimposed on the Normal(0, 9) 
density function. The right column displays the relative difference in V n versus n. 

Numbers are rounded to the closest value on the grid. Second, we implement the 
slice sampling algorithm with single-variable updates; each iteration consists of 120 
updates for each variable in sequence. We use the "stepping-out" procedure with an 
initial interval of size 1. We compute V n every 100 iterations until the absolute value 
of the relative difference is below e = 0.01. 

The left column of Figure [3] compares the histograms of the sampled values of X 
with the true probability distribution function; the histograms are based on chains 
of 4600 and 17200 iterations, respectively. Metropolis-Hastings oversamples negative 
values of X and undersamples positive ones; slice sampling samples correctly in the 
left tail of the distribution, but undersamples positive values. The right column dis- 
plays the behaviour of the relative difference in V n ; the variance function undergoes 
sharp increases in value under both sampling methods, but stabilizes towards the end 
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Figure 4: Autocorrelation of X in application 2. Slice sampling has a faster rate of 
convergence than Metropolis-Hastings evidenced by the smaller autocorrelation. 



Metropolis-Hastings 



Slice sampling 




5000 10000 15000 20000 25000 30000 
iteration 



5000 10000 15000 20000 

iteration 



Figure 5: Relative difference in V n versus n for eleven parallel chains in application 
2. The value of V n under Metropolis-Hastings sampling seems to be more stable than 
under slice sampling. 
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of the run. The behaviour of the variance function fails to reflect the incorrect sam- 
pling in the tails of the distribution. The plot of the relative difference in variance for 
the Metropolis-Hastings algorithm indicates that a smaller value of e would be more 
appropriate for assessing convergence. The plots in Figure H] show that the autocor- 
relation obtained by slice sampling remains close to zero after 100 iterations, whereas 
that obtained by Metropolis-Hastings continues to fluctuate even after 1000 itera- 
tions. This indicates that the Metropolis-Hastings algorithm converges more slowly 
than slice sampling. We compute the Raftery and Lewis (1992) convergence diagnos- 



tic using the Coda package in R (http://www.r-project.org) obtaining dependence 



factors of 14 and 18.7 for the Metropolis-Hastings and the slice sampling algorithms, 
respectively, indicating strong autocorrelation. 

Finally, we run eleven parallel chains started from the following quantiles of the 
marginal distribution of X : {0.1, 0.2, 0.3, 0.4, 0.45, 0.5, 0.55, 0.6, 0.7, 0.8, 0.9}; we em- 
ploy the value e = 0.01. We expect the parameter space to be insufficiently explored 
by both algorithms; however, we are interested in whether this insufficient exploration 
can be detected from the behaviour of V n across chains with overdispersed starting 
points. Pooling the sampled values results in chains of 30800 and 19800 draws, re- 
spectively; thus the measure of efficiency of Metropolis-Hastings versus slice sampling 
is 1.56. Trace plots and histograms indicate that negative values of X are oversam- 
pled and positive ones are undersampled by both algorithms. Figure [5] is obtained 
by pooling the sampled values across the eleven chains; the behaviour of V n under 
slice sampling poses signs of concern regarding convergence to stationarity (notice the 
frequent increases in value from iteration 17500 onwards), whereas the value of V n 
under Metropolis-Hastings appears stable towards the end of the run. Therefore the 
behaviour of V n under slice sampling across eleven chains with overdispersed start- 
ing points indicates lack of convergence to stationarity, whereas the behaviour of V n 



21 



under Metropolis-Hastings, which is known to allow a more restrictive exploration of 
the support space, gives misleading results. 

5. CONCLUSION 

The last fifty years have witnessed the development and rise in popularity, in par- 
ticular in Bayesian statistical inference, of Markov Chain Monte Carlo methods for 
simulating from complex probability distributions (Smith and Roberts 1993). For a 
practitioner who has a finite MCMC output, questions arise regarding how reliable 
the sample is as a representation of it. Although a wealth of convergence diagnostic 
tools for analysing MCMC output have been proposed over the past decades, their 
performance, in general, is problem-specific, and developing a dependable, easy to 
implement tool for convergence assessment continues to be a challenge. This arti- 
cle presents a new convergence assessment method for irreducible, aperiodic Markov 
chains on discrete spaces obtained by MCMC samplers that satisfy the principle of 
detailed balance and requirement fll]). We introduce a one-dimensional test statistic 
whose behaviour under the assumption of stationarity is analyzed both theoretically 
and experimentally, and present a possible implementation of our criterion as a graph- 
ical tool for convergence assessment. 

In low dimensional problems, the proposed criterion as a qualitative tool assesses 
convergence satisfactorily; however, in high dimensional problems, the criterion is 
unreliable for convergence assessment, but can provide useful insight into lack of 
convergence of the chain to stationarity. In particular, if the variance function expe- 
riences sharp increases in value, then it can be concluded that stationarity has not yet 
been reached; however, if the value of the variance function is stable, then the results 
are inconclusive. The advantage of our method lies in its attempt to analyse the 
behaviour of an MCMC chain travelling through a possibly high dimensional space 
by monitoring the behaviour of a one-dimensional statistic. Lack of convergence to 
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stationarity is correctly assessed by the behaviour of the statistic to the extent to 
which the sampler explores freely the underlying space. Particularly in high dimen- 
sional problems with irregularly shaped distribution functions, we recommend that 
the MCMC output be analyzed using different e values, compared across multiple 
chains, and that several diagnostic tools be employed. 

There exist in the literature at least two convergence assessment criteria based on 
weighting functions that are very similar to our approach. Ritter and Tanner (1992) 
propose to detect convergence to the full joint distribution by monitoring conver- 
gence of the importance weight Wt = 7r(x)/g t (x), where gt is the joint distribution of 
the observations sampled at iteration t. They estimate gt(x) by ^ YULiP( x \ x t-i) i 
where a^_ 1? i = l,...,m is a sample from gt-i- If the chain has converged, the 
distribution of the weights Wt, based on multiple replications of the chain, will be 
degenerate about a constant. Zellner and Min (1995) propose a convergence criterion 
for the Gibbs sampler in the special case that x can be partitioned into (x^,x^))- 
They define two criteria based on the weight functions W\ = p{x(i))p(x(2)\x(\)) — 
p{x {2 ))p{x(i)\x {2 )) and W 2 = [p(x {1) )p(x {2) \x {1) )]/[p(x { 2))p(x(i)\x(2))], where p {1) is es- 
timated by ^ YlT=i p{ x (i) l x (2)) ' an d x (2)' j = 1, ■ ■ ■ ,TTiis the sequence of draws of X{2) 
obtained by Gibbs sampling. They compute the value of these weights at many points 
in the parameter space and argue that if the chain has converged, then the values of 
W\ will be close to and those of W 2 close to 1. Zellner and Min use asymptotic 
results from the stationary time series literature to calculate posterior odds for the 
hypothesis Hq : W\ = vs. Hi : W\ ^ for the /c-dimensional case, k > 1, when 
the weights are computed at k different points in the parameter space. 

The main drawback of these methods is the assumption that the transition proba- 
bility p(x\x t _i), in the method of Ritter and Tanner, and the conditionals p(x(i)\x(2)) 
and p{x{2) |^(i)), in the method of Zellner and Min, exist explicitly. Our method, how- 
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ever, makes no such assumption and estimates 7Tj, the probability of being in state i, 
by the empirical distribution function. All three methods have the disadvantage of 
being computationally expensive; the ergodic averages used to approximate various 
marginal and conditional probabilities (in our method, 7r$) require a large number of 
summands in order to provide good estimates, so large numbers of iterations, and 
possibly many replicates of the chain, are needed. Furthermore, since the normaliz- 
ing constant of 7r is unknown, the functions f\ and the weights w t of the criterion of 
Ritter and Tanner might stabilize around an incorrect value if the sampler has failed 
to explore all the high density regions of the space. For this reason, we recommend 
to run multiple replicates of the chain started from different regions of the space. 
The criterion of Zellner and Min also gives misleading results if the space is poorly 
explored and the weights are computed at points that come from low density regions. 
Finally, our criterion has an intuitive graphical representation, very similar to that 
proposed by Ritter and Tanner, and, whereas the criterion of Zellner and Min uses 
multivariate weight functions, our criterion is based on a one- dimensional statistic 
regardless of the dimension of the underlying space, thus offering a dimensionality 
reduction approach to the problem of convergence assessment in high dimensional 
spaces. 

An interesting alternative to approximating a continuous state space by a discrete 
grid is to sample the continuous state-space Markov chain and to apply the discretiza- 
tion method developed by Guihenneuc-Jouyaux and Robert (1998). Provided that 
the continous chain is Harris-recurrent, the method defines renewal times based on 
the visiting times to one of m disjoint small sets in the support space. By subsampling 
the underlying chain at the renewal times, the method builds a homogeneous Markov 
chain on the finite state space {l,...,m}. Our propoposed criterion can then be 
applied to the finite chain; it would be interesting to explore whether the convergence 
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assessment extends to the continous Markov chain. 
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